[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947
Closed
zhengruifeng wants to merge 16 commits into
Closed
[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947zhengruifeng wants to merge 16 commits into
zhengruifeng wants to merge 16 commits into
Conversation
Contributor
Author
|
@vladimirg-db @cloud-fan would you mind taking a look? thanks |
Contributor
|
Shall we do some investments on the test framework first? It's not the first time we want test parity between classic and connect, for both scala and python APIs, but duplicating the test case is not a good long term solution. How can we define the base test cases in the base suite, and extend the suite in both classic and connect with their dedicated test cases? |
Contributor
Author
|
@cloud-fan this PR is not just for parity (thought some tests behave the same on both connect and classic). E.g. |
…esolution Add parity tests in test_connect_column.py and layered programs in test_parity_dataframe.py to lock in known Connect/Classic behavior differences in DataFrame column resolution. Tests cover shadowing, pass-through, aggregation, pivot, set ops, self-join, subquery-as-table patterns under both strict and non-strict modes of spark.sql.analyzer.strictDataFrameColumnResolution, plus three mixed-surface layered programs that combine filters, joins, aggregations, set ops, window functions, UDFs and temp views. Generated-by: Claude Code (Anthropic), claude-opus-4-7
…sic baselines Move the focused parity tests and the layered programs into a dedicated SparkConnectColumnResolutionTests class in test_connect_column.py. Each test now starts with a Classic baseline against self.spark to document how Spark Classic handles the same pattern, followed by Connect strict and Connect lenient blocks. test_parity_dataframe.py is reverted to its prior state. Generated-by: Claude Code (Anthropic), claude-opus-4-7
Replace the three shallow layered programs with Reyden-influenced
fixtures that mirror the patterns referenced in SC-229895
(reyden/query-tests/golden-files/layered-query-tests):
- 4-level subquery chain with windows, HAVING and correlated EXISTS
- CTE chain with GROUPING SETS, NTILE, struct field access and
correlated IN
- Self-join via SQL with windowed running totals, correlated EXISTS
and UDF wrapping
Each program builds the deeply layered base via spark.sql(), then
layers DataFrame-API shadowing on top with a tagged df["c"] reference
at the outermost select. Classic and Connect strict raise; Connect
lenient succeeds via name-based fallback.
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
The Connect-specific AnalysisException subclasses the base one, so a single import covers both. The assertRaisesRegex regex still pins the Connect-specific CANNOT_RESOLVE_DATAFRAME_COLUMN error class. Generated-by: Claude Code (Anthropic), claude-opus-4-7
…y assertions - Rewrite the three mixed-surface layered tests to use only the DataFrame API (chained transformations, semi-joins, Window functions, cube, UDFs and struct field access). spark.sql() is no longer used in the layered tests. - After running the suite locally, correct several assertions that assumed Classic-vs-Connect divergence where none exists: groupBy, intersect, pivot, and temp-view-roundtrip all preserve attribute-id propagation through Connect's plan-id resolver, so the tagged reference resolves in both strict and lenient modes. - For union the divergence does hold: Classic succeeds but Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both strict and lenient modes (the name-based fallback is not triggered for set-op outputs). Generated-by: Claude Code (Anthropic), claude-opus-4-7
Rename df / df1 / df2 variables that refer to Connect DataFrames in SparkConnectColumnResolutionTests to cdf / cdf1 / cdf2, matching the existing sdf convention for Classic. Drop the per-test pyspark.sql.types imports from the layered tests; the needed types are already imported at the top of the file. Generated-by: Claude Code (Anthropic), claude-opus-4-7
Based on the resolution logic in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala add four tests covering behaviors not previously exercised: - test_resolve_cross_dataframe_illegal_reference: the documented df1.select(df2.a) case where df2's plan id is not in df1's plan tree; fails with CANNOT_RESOLVE_DATAFRAME_COLUMN before any name-based fallback can run. - test_resolve_df_star: plan-id-tagged star expansion via UnresolvedDataFrameStar; succeeds in both Classic and Connect modes. - test_resolve_self_join_withcolumnrenamed: the documented self-join disambiguation example where the right-side candidate is filtered out by the rename projection above it. - test_resolve_sort_missing_attr_recovery: the documented df.select(df.v).sort(df.id) case where Sort's reference is recovered by resolveExprsAndAddMissingAttrs adding id back to the upstream projection, in both strict and lenient modes. Tighten comments on two existing tests: - test_resolve_after_union: cite that Union is treated as a leaf node when walking the plan tree for plan-id resolution. - test_resolve_self_join_alias: cite AMBIGUOUS_COLUMN_REFERENCE as the specific failure mode. Generated-by: Claude Code (Anthropic), claude-opus-4-7
Reframe the 3 layered DataFrame tests around ``layered[col]`` usage in filter and select at the outermost surface (no synthetic shadow on top), and move them into the shared ``ColumnTestsMixin`` so they run under Classic (``ColumnTests``), Connect strict (``ColumnParityTests``), and Connect lenient (``ColumnParityTestsWithNonStrictDFColResolution``) via the existing parity framework instead of building separate Classic and Connect arms in one test body. Generated-by: Claude Code (Anthropic), claude-opus-4-7
Reference DataFrame columns via the ``df.col_name`` getattr form instead of ``sf.col(...)`` / ``df["col"]`` in the three layered tests, binding the post-agg intermediates to variables so a DataFrame is in scope for the HAVING-style filters. Struct subfields keep bracket access because ``detail.name`` would resolve to ``Column.name``. Generated-by: Claude Code (Anthropic), claude-opus-4-8
Move the tagged DataFrame column-resolution tests out of the Connect-only SparkConnectColumnResolutionTests and into the shared ColumnTestsMixin, so each runs under Classic (ColumnTests), Connect strict (ColumnParityTests), and Connect lenient (ColumnParityTestsWithNonStrictDFColResolution). The mixin body asserts the behavior shared by Classic and Connect strict. The diverging cases are overridden in the parity suites: - union: ColumnParityTests asserts Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both modes (Classic resolves via attribute-id propagation in the mixin); - the shadowing trio: the lenient suite asserts name-based fallback succeeds. Tagged references use the df.col getattr form; df["*"] keeps bracket access and untagged transform expressions keep sf.col. Drops the now-unused SparkConnectColumnResolutionTests and its AnalysisException import. Generated-by: Claude Code (Anthropic), claude-opus-4-8
ad6758b to
29f715b
Compare
Move ``from pyspark.sql.window import Window`` out of the three layered test methods to the module-level imports. Add a comment on each shadowing resolution test noting that Connect lenient mode diverges (name-based fallback succeeds, overridden in the lenient parity suite). Generated-by: Claude Code (Anthropic), claude-opus-4-8
df.c on a DataFrame with only column `x` raises PySparkAttributeError before reaching the analyzer. Use df["c"] so the tagged UnresolvedAttribute is built eagerly and the AnalysisException is thrown at analysis time, which is what this test pins. Same fix applied in the lenient parity override. Generated-by: Claude Code (Anthropic), claude-opus-4-8
SELECT 1 AS x gave a df with no column c, so df.c raised
PySparkAttributeError before Spark analysis. Change the source to
SELECT 1 AS c and shadow via agg(sum("c").alias("c")) so df.c is valid
and the tagged reference is aggregated away - same semantics, correct form.
Generated-by: Claude Code (Anthropic), claude-opus-4-8
Both Union and Intersect preserve the left child's exprId (Union/Intersect.mergeChildOutputs), so neither "emits new attribute ids". The actual Connect divergence is that Union is treated as a leaf during plan-id resolution (ColumnResolutionHelper) while Intersect is not. Co-authored-by: Isaac
…L) group CUBE produces subtotal groups; ROLLUP is a distinct grouping operator, so 'subtotal' describes the (category, status, NULL) group more precisely. Co-authored-by: Isaac
Contributor
|
thanks, merging to master/4.x/4.2 (test only) |
cloud-fan
added a commit
that referenced
this pull request
May 31, 2026
…ame column resolution ### What changes were proposed in this pull request? This PR widens test coverage of how a tagged DataFrame column reference (`df.col` / `df["col"]`, which carries the source DataFrame's plan id) resolves after a range of operators, pinning the behavior across Spark Classic and both modes of `spark.sql.analyzer.strictDataFrameColumnResolution` on Spark Connect. The new tests are added to the shared `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`, so each one runs in three environments: - `ColumnTests` - Classic (default `strictDataFrameColumnResolution=true`), - `ColumnParityTests` - Connect strict, - `ColumnParityTestsWithNonStrictDFColResolution` - Connect lenient (extends the strict suite). The base mixin asserts the behavior shared by all three. The few cases where Connect diverges are overridden in the Connect parity suites (`python/pyspark/sql/tests/connect/test_parity_column.py`) rather than duplicated, so the shared cases stay mode-agnostic. This follows the base-suite + per-mode-override structure suggested in review. **Per-operator resolution tests** (added to `ColumnTestsMixin`): - Pass-through (`filter`, `sort`, `distinct`) - all modes resolve. - Attribute-id propagation (`groupBy().count()`, `pivot`, `intersect`, temp-view roundtrip) - all modes resolve; the source attribute id flows through. - Removal (`withColumnRenamed`, `drop`) - all modes raise (column gone by id and by name). - Shadowing (`withColumn` chain, `select` + alias, `agg` + alias) - Classic and Connect strict raise; **Connect lenient** resolves the shadowed name via name-based fallback (overridden in the lenient suite). - Union - Classic resolves the left-side reference (Union keeps the left child's attribute ids); **Connect** raises `CANNOT_RESOLVE_DATAFRAME_COLUMN` in both modes, because Union is treated as a leaf during plan-id resolution (overridden in the strict suite, inherited by lenient). - Self-join - the alias form is ambiguous and raises in all modes; the documented `withColumnRenamed` form resolves in all modes (the renamed side is filtered out during disambiguation). - Cross-DataFrame illegal reference (`df1.select(df2.col)`) - raises in all modes; the throw is not gated by the strict/lenient switch. - `df["*"]` star expansion and sort-missing-attribute recovery - resolve in all modes. **Mixed-surface layered programs** (3): each chains 4-5 transformations - semi-joins (DataFrame-API EXISTS/IN), window functions, cube aggregation, NTILE, UDFs, struct-field access - and then references the outermost layer's columns in both `filter` and `select`. These exercise plan-id propagation across interacting analyzer rules, which single-operator tests miss. All modes resolve. ### Why are the changes needed? #55531 added the `strictDataFrameColumnResolution` config and a single shadowing test. This PR enumerates the Connect-vs-Classic resolution behavior across shadowing variants, attribute-id-propagating operators, set operations, self-joins, and multi-operator layered pipelines. A future tightening of Connect's column resolution will then surface as a clear test failure rather than a silent regression. ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? New tests, run locally across all three suites: ``` python/run-tests --testnames "pyspark.sql.tests.test_column ColumnTests" python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTests" python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTestsWithNonStrictDFColResolution" ``` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic), claude-opus-4-7 Closes #55947 from zhengruifeng/SC-229895-connect-col-tests. Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 37fcee4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan
added a commit
that referenced
this pull request
May 31, 2026
…ame column resolution ### What changes were proposed in this pull request? This PR widens test coverage of how a tagged DataFrame column reference (`df.col` / `df["col"]`, which carries the source DataFrame's plan id) resolves after a range of operators, pinning the behavior across Spark Classic and both modes of `spark.sql.analyzer.strictDataFrameColumnResolution` on Spark Connect. The new tests are added to the shared `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`, so each one runs in three environments: - `ColumnTests` - Classic (default `strictDataFrameColumnResolution=true`), - `ColumnParityTests` - Connect strict, - `ColumnParityTestsWithNonStrictDFColResolution` - Connect lenient (extends the strict suite). The base mixin asserts the behavior shared by all three. The few cases where Connect diverges are overridden in the Connect parity suites (`python/pyspark/sql/tests/connect/test_parity_column.py`) rather than duplicated, so the shared cases stay mode-agnostic. This follows the base-suite + per-mode-override structure suggested in review. **Per-operator resolution tests** (added to `ColumnTestsMixin`): - Pass-through (`filter`, `sort`, `distinct`) - all modes resolve. - Attribute-id propagation (`groupBy().count()`, `pivot`, `intersect`, temp-view roundtrip) - all modes resolve; the source attribute id flows through. - Removal (`withColumnRenamed`, `drop`) - all modes raise (column gone by id and by name). - Shadowing (`withColumn` chain, `select` + alias, `agg` + alias) - Classic and Connect strict raise; **Connect lenient** resolves the shadowed name via name-based fallback (overridden in the lenient suite). - Union - Classic resolves the left-side reference (Union keeps the left child's attribute ids); **Connect** raises `CANNOT_RESOLVE_DATAFRAME_COLUMN` in both modes, because Union is treated as a leaf during plan-id resolution (overridden in the strict suite, inherited by lenient). - Self-join - the alias form is ambiguous and raises in all modes; the documented `withColumnRenamed` form resolves in all modes (the renamed side is filtered out during disambiguation). - Cross-DataFrame illegal reference (`df1.select(df2.col)`) - raises in all modes; the throw is not gated by the strict/lenient switch. - `df["*"]` star expansion and sort-missing-attribute recovery - resolve in all modes. **Mixed-surface layered programs** (3): each chains 4-5 transformations - semi-joins (DataFrame-API EXISTS/IN), window functions, cube aggregation, NTILE, UDFs, struct-field access - and then references the outermost layer's columns in both `filter` and `select`. These exercise plan-id propagation across interacting analyzer rules, which single-operator tests miss. All modes resolve. ### Why are the changes needed? #55531 added the `strictDataFrameColumnResolution` config and a single shadowing test. This PR enumerates the Connect-vs-Classic resolution behavior across shadowing variants, attribute-id-propagating operators, set operations, self-joins, and multi-operator layered pipelines. A future tightening of Connect's column resolution will then surface as a clear test failure rather than a silent regression. ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? New tests, run locally across all three suites: ``` python/run-tests --testnames "pyspark.sql.tests.test_column ColumnTests" python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTests" python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTestsWithNonStrictDFColResolution" ``` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic), claude-opus-4-7 Closes #55947 from zhengruifeng/SC-229895-connect-col-tests. Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 37fcee4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR widens test coverage of how a tagged DataFrame column reference (
df.col/df["col"], which carries the source DataFrame's plan id) resolves after a range of operators, pinning the behavior across Spark Classic and both modes ofspark.sql.analyzer.strictDataFrameColumnResolutionon Spark Connect.The new tests are added to the shared
ColumnTestsMixininpython/pyspark/sql/tests/test_column.py, so each one runs in three environments:ColumnTests- Classic (defaultstrictDataFrameColumnResolution=true),ColumnParityTests- Connect strict,ColumnParityTestsWithNonStrictDFColResolution- Connect lenient (extends the strict suite).The base mixin asserts the behavior shared by all three. The few cases where Connect diverges are overridden in the Connect parity suites (
python/pyspark/sql/tests/connect/test_parity_column.py) rather than duplicated, so the shared cases stay mode-agnostic. This follows the base-suite + per-mode-override structure suggested in review.Per-operator resolution tests (added to
ColumnTestsMixin):filter,sort,distinct) - all modes resolve.groupBy().count(),pivot,intersect, temp-view roundtrip) - all modes resolve; the source attribute id flows through.withColumnRenamed,drop) - all modes raise (column gone by id and by name).withColumnchain,select+ alias,agg+ alias) - Classic and Connect strict raise; Connect lenient resolves the shadowed name via name-based fallback (overridden in the lenient suite).CANNOT_RESOLVE_DATAFRAME_COLUMNin both modes, because Union is treated as a leaf during plan-id resolution (overridden in the strict suite, inherited by lenient).withColumnRenamedform resolves in all modes (the renamed side is filtered out during disambiguation).df1.select(df2.col)) - raises in all modes; the throw is not gated by the strict/lenient switch.df["*"]star expansion and sort-missing-attribute recovery - resolve in all modes.Mixed-surface layered programs (3): each chains 4-5 transformations - semi-joins (DataFrame-API EXISTS/IN), window functions, cube aggregation, NTILE, UDFs, struct-field access - and then references the outermost layer's columns in both
filterandselect. These exercise plan-id propagation across interacting analyzer rules, which single-operator tests miss. All modes resolve.Why are the changes needed?
#55531 added the
strictDataFrameColumnResolutionconfig and a single shadowing test. This PR enumerates the Connect-vs-Classic resolution behavior across shadowing variants, attribute-id-propagating operators, set operations, self-joins, and multi-operator layered pipelines. A future tightening of Connect's column resolution will then surface as a clear test failure rather than a silent regression.Does this PR introduce any user-facing change?
No. Test-only change.
How was this patch tested?
New tests, run locally across all three suites:
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), claude-opus-4-7